GroupNormFusion
=================
对输入张量执行分组归一化融合操作（Group Normalization Fusion），在多核环境中按批次拆分并行完成均值、方差与归一化计算。

    .. math::

        \hat{x}_{(b,u,c)} = \frac{x_{(b,u,c)} - \mu_{(b,g)}}{\sqrt{\sigma^2_{(b,g)} + \epsilon}}, \quad y_{(b,u,c)} = \hat{x}_{(b,u,c)} \cdot scale_c + offset_c

    其中 :math:`(b,u,c)` 表示批次、空间位置与通道索引，:math:`g` 为通道所属分组。

    输入：
        - **input** - 输入张量首地址，形状 ``[batch, unit, channel]``。
        - **scale** - 通道缩放系数首地址，长度为 ``channel``。
        - **offset** - 通道偏移系数首地址，长度为 ``channel``。
        - **mean** - 批次 × 分组的均值缓冲区首地址，长度为 ``batch * num_groups``。
        - **variance** - 批次 × 分组的方差缓冲区首地址，长度为 ``batch * num_groups``。
        - **epsilon** - 数值稳定项。
        - **num_groups** - 分组数。
        - **channel** - 通道总数。
        - **unit** - 每批次内的归一化单元数（H×W）。
        - **batch** - 批次数。
        - **core_mask(int, 可选)** - 核掩码（仅适用于共享存储版本）。

    输出：
        - **output** - 写回分组归一化结果的张量首地址。

    支持平台：
        ``FT78NE``
        ``MT7004``

    .. note::
        - FT78NE 支持 fp32 数据类型。
        - MT7004 支持 fp16、fp32 数据类型。


**共享存储版本:**

.. c:function:: void hp_groupnormfusion_s(const half *input, const half *scale, const half *offset, half *mean, half *variance, float epsilon, int num_groups, int channel, int unit, int batch, int core_mask, half *output)
.. c:function:: void fp_groupnormfusion_s(const float *input, const float *scale, const float *offset, float *mean, float *variance, float epsilon, int num_groups, int channel, int unit, int batch, int core_mask, float *output)

    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 17

        // FT78NE 多核示例
        #include <stdio.h>

        int main(void) {
            const float *input = (const float *)0xA0000000;     // DDR 存储
            const float *scale = (const float *)0xB0000000;
            const float *offset = (const float *)0xB0001000;
            float *mean = (float *)0xB0002000;
            float *variance = (float *)0xB0003000;
            float *output = (float *)0xC0000000;
            int num_groups = 8;
            int channel = 64;
            int unit = 49;
            int batch = 32;
            float epsilon = 1e-5f;
            int core_mask = 0xff;
            fp_groupnormfusion_s(input, scale, offset, mean, variance,
                                 epsilon, num_groups, channel, unit,
                                 batch, core_mask, output);
            return 0;
        }


**私有存储版本:**

.. c:function:: void hp_groupnormfusion_p(const half *input, const half *scale, const half *offset, half *mean, half *variance, float epsilon, int num_groups, int channel, int unit, int batch, half *output)
.. c:function:: void fp_groupnormfusion_p(const float *input, const float *scale, const float *offset, float *mean, float *variance, float epsilon, int num_groups, int channel, int unit, int batch, float *output)

    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 16

        // MT7004 单核示例
        #include <stdio.h>

        int main(void) {
            const half *input = (const half *)0x10000000;       // L2 存储
            const half *scale = (const half *)0x10004000;
            const half *offset = (const half *)0x10008000;
            half *mean = (half *)0x1000C000;
            half *variance = (half *)0x10010000;
            half *output = (half *)0x10014000;
            int num_groups = 4;
            int channel = 32;
            int unit = 36;
            int batch = 16;
            float epsilon = 1e-4f;
            hp_groupnormfusion_p(input, scale, offset, mean, variance,
                                 epsilon, num_groups, channel, unit,
                                 batch, output);
            return 0;
        }